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Abstract 


Experimental  data  on  transient  faults  from  several  digital  computer  systems  are  preaentad 
and  analyzed.  This  research  is  significant  because  earlier  work  or.  validation  of  ratiabllity 
models  has  concentrated  only  on  permanent  faults.  The  systems  for  which  data  have  been 
collected  are  the  DEC  POP- 10  series  computers,  the  Cm#  multiprocessor,  and  the  C.Vfiip  fault 
tolerant  microprocessor.  Current  results  show  that  transient  faults  do  not  Occur  With 
constant  failure  rates  as  has  been  commonly  assumed.  Instead,  t^  data  for  all  three  ayatema 
iixlicate  Welbull  distributions  with  decreasing  failure  rates.  ,uo  ^  ^ -  ^ 
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F336 15-78-0-1551;  by  the  Office  of  Naval  Research  under  Contract  NR-^48-645)  and  by 
Digital  Equipment  Corporation.  The  views  and  conclusions  contained  in  this  document  are 
those  of  the  authors  and  should  not  be  interpreted  as  representing  the  official  poltclea,  either 
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1.  Introduction 

The  problems  posed  by  transient  faults  have  largely  been  neglected  in  the  reeeerch 
literature,  except  for  occasional  items  of  limited  scope  [WaKe75].  A  general  model  for 
transient  fault  occurrence  and  recovery  based  on  a  Markov  model  with  the  assumed 
exponential  distribution  for  interarrival  times  was  presented  in  [Aviz77].  This  approach  is 
typical  of  efforts  to  model  all  types  of  faults  --  permanent  [Oepa74]  arKi  Intermittent 
[Spii77,  Su78]  as  well  as  transient.  One  goal  of  the  research  reported  here  was  an  attempt 
to  test  the  assumed  exponential  distribution  for  transient  faults. 

1.1  Definitione 

The  following  terms  need  to  be  precisely  defined: 

Fault  erroneous  state  of  hardware  due  either  to  failures  of  components  or  to 

physical  interference  from  the  environments 

Error  manifestation  of  a  fault  within  a  program  or  data  structures 

Permanent  fault  or  error  which  is  continuous  and  stable,  reflecting  an  irreversible 

physical  change  in  the  hardware; 

Intermittent  fault  or  error  which  is  only  occasionally  detectable,  due  to  unstable 

physical  failures  or  environmental  conditions; 

Transient  fault  or  error  due  to  temporary  environmental  cortdltlons. 

The  distinction  between  intermittent  and  transient  faults  Is  not  always  made  in  the 
literature  [Kama75,  Tasa77].  The  dividing  line  chosen  for  this  study  is  that  of  the 
applicability  of  testing  [Breu73,  Kama74,  Losq78,  Savi78].  Faults  due  to  underlying  physical 
conditions  (permanent  and  intermittent)  of  the  hardware  or  unstable  but  repeated 
environmental  conditions  are  at  least  potentially  detectable  by  testing,  and  then  repalrabie. 
However,  faults  due  to  temporary  environmental  conditions  are  incapable  of  repair,  aa  the 
hardware  is  physically  undamaged.  It  is  this  attribute  of  transient  faults  wNch  magnifies 
their  importance.  Even  in  the  absence  of  all  physical  defects,  including  those  manifested  as 
intermittent  errors,  faults  will  still  occur.  Fault  tolerant  techniques  Will  still  be  required  to 
prevent  such  errors  from  causing  computer  systems  to  crash. 

1.2  Significance  of  Transient  Errors 


The  importance  of  transient  faults  comes  from  two  factors:  their  relatively  frequent 


occurrence  and  the  impossibility  of  preventative  maintenance.  Several  studies  have  shown 
that  permanent  faults  cause  only  a  small  fraction  of  all  detected  errors.  The  avattable  figures 
for  several  systems  are  given  in  Table  1-1  below  [Full78,  Morg78,  Morg78a,  Slew78].  The 
last  row  of  figures  are  estimates  comparing  the  hard  and  soft  failure  rates  for  a  one 
megaword  by  37  bit  memory  array  composed  of  4K  MOS  RAMs  [Gleil79,  0hm79].  (Soft  failures 
are  transient  failures  caused  by  the  radioactivity  in  the  packaging  of  the  chips  themselves.) 

Detection  MTBE  per  MTTF  per 

System  Technology  Mechanism  Processor  Processor  MTH^yMTTF 

CMUA  POP- 10,  ECL  Parity  44  hrs.  800-1600  hrs.  OX>3-0.06 

Cms  LSI-11,  NMOS  Diagnostics  128  hrs.  4200  hrs.  0.03 

C.vmp  TMR  LSI-11  Crash  97-328  hrs.  4900  hrs.  OX)2-0.07 

Telettra,  TTL  Mismatch  80-170  hrs.  1300  hrs.  0.01-0.13 

lMx37RAM,M0S  (Parity)  106  hrs.  1450  hrs.  0.07 

Table  1-1:  Ratios  of  Transient  to  Permanent  Errors 

1.3  Overview  of  Paper 

Section  2  describes  the  three  different  architectures  for  which  data  has  been  gathered, 
and  gives  details  about  the  software  systems  used  on  two  of  them  for  semi -automated  data 
collection.  The  data  analysis  in  Section  3  shows  how  a  decreasing  failure  rate  Welbull 
process  fits  the  data  better  than  a  Poisson  process.  General  conclusions  and  areas  for 
further  research  are  detailed  in  Section  4. 
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2.  Data  Collection  for  Transient  Errors 

2.1  Description  of  Systems 

Short  descriptions  of  the  three  architectures  for  which  data  has  been  gathered  are  given 
to  show  the  wide  range  of  system  types  which  all  support  the  same  general  conclusion  of 
decreasing  failure  rates. 


2.1.1  Description  of  PDP-10  Systems 

The  PDP-IO^  is  a  general  purpose  36-bit  computer  packaged  either  as  a  DECsystem-lO 
running  a  time  sharing  operating  system  called  TOPS-10,  or  as  a  DECsystem-20  running 
TOPS-20  [Bell 78].  The  main  system  for  the  Department  of  Computer  Science  at 
Carnegie-Mellon  University  is  a  DECsystem-10  known  as  CMUA,  with  a  KL-10  (ECL)  processori 
one  megaword  of  memory,  eight  disk  drives  totalling  1600  megabytes  of  online  storagOi  and 
two  magnetic  tape  drives.  In  addition,  the  system  is  connected  to  a  PDP-11  front  end 
processor  which  multiplexes  a  large  number  of  video  terminals.  This  system  is  Used  almost 
exclusively  to  support  the  research  programs  of  the  department. 

Other  systems  for  which  data  has  been  collected  in  this  study  are  the  three 
DECsystem-20's  operated  by  the  university’s  Computation  Center.  These  systems,  known  as 
TOPSA,  TOPSB,  and  TOPSC,  are  used  to  support  general  educational  and  administrative  needs 
of  the  university.  TOPSA,  used  primarily  for  administrative  work,  has  256K  of  memory,  four 
disk  drives  totalling  over  700  megabytes  of  online  storage,  and  two  magnetic  tape  drives. 
TOPSB  and  TOPSC,  used  for  educational  support,  each  have  512K  of  memory,  three  disk 
drives  with  528  megabytes  of  online  storage,  and  two  magnetic  tape  drives.  Each  of  these 
three  systems  also  connects  to  a  number  of  terminals. 


2.1.2  Description  of  Cm* 

The  structure  of  Cm*  is  shown  in  Figure  2-1.  This  is  a  structure  with  a  low  concurrency 
switch  (the  network  of  buses)  giving  access  to  a  shared  memory.  The  structure  is  built  from 
Processor -Memory  pairs  called  Computer  Modules  or  Cm’s.  The  memory  local  to  a  processor 
is  also  the  shared  memory  of  the  system.  Each  Cm  has  a  local  switch,  or  Slocal,  which 
interfaces  its  bus  to  the  rest  of  the  system.  A  mapping  controller,  or  Kmap,  Is  shared  by 
several  Cm’s,  which  are  connected  to  it  by  their  Slocal’s  to  form  a  "cluster*.  Each  Kmap  In 
the  system  is  connected  via  multiple  interciuster  buses  to  other  Kmaps,  completing  the 


‘dec,  PDP-10,  DECaytUm-IO,  TOPS-IO,  KL-IO,  DEC*y*tMi-20,  r0PS-20,  PDP-ll,  SAd  181-11  srs  ill  rsflslarsd 
trsdamarhs  of  Oifital  EquipnwnI  Corportlion. 
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interconnection  scheme.  The  Kmaps  perform  all  the  functions  necessary  for  meeting  both 
intracluster  and  intercluster  memory  requests.  With  this  structure,  any  proceeaor  can  ecoeee 
any  memory  in  the  system,  with  increasing  time  penalty  for  the  "distance*  aWay— ^t**  OWn 
local  memory,  memory  belonging  to  another  processor  in  the  same  cluster,  or  memory  Within 
another  cluster. 


Intercluster  Bus  A  Intercluster  Bus  B 


Figure  2~2  shows  the  details  of  a  Computer  Module.  The  processor  is  a  DEC  LSI*-il.  Each 
Cm  has  dynamic  RAM  memory  and  individual  Cm's  have  a  variety  of  I/O  devices.  SevOral  Cm's 
have  serial  links  to  the  Cma  host  computer,  a  message  switching  PDP-11  [Scel77],  for 
facilitating  user  access  and  program  loading.  Other  devices  include  disk  drives  and  line  time 
clocks. 

The  original  configuration  of  Cm*,  shown  in  Figure  2-3,  contained  ten  Cm's  connected  to 
three  Kmaps  to  form  three  clusters,  two  with  four  Cm's  and  one  with  two.  In  this  setup, 
referred  to  as  Cm*/10,  each  Cm  had  a  serial  link  to  the  Cm*  host.  This  greatly  facilitated  the 
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Map  Bus 


Figure  2-2:  Cm*  Computer  Module 


Intercluster  Buses 


Initial  aytam  debugging  and  software  development  efforts. 


The  architecture  of  Cm*  has  been  widely  reported  in  the  literature.  8m 
[Swan77,  Swan78]  for  detailed  descriptions  and  additional  references  to  published  reports. 

2.1.3  Description  of  C.vmp 

C.vmp  is  a  triplicated  NMOS  LSI-11  microprocessor  with  voting  at  the  bus  level 
[McCo78,  Siew77,  Siew78,  Siew78a}.  There  are  three  processor-memory  pairs,  each  petr 
connected  via  a  voter  circuit  shown  in  Figure  2-4.  With  the  voter  active,  the  three  buses  are 
voted  upon  and  the  result  of  the  vote  is  sent  out.  Any  disagreements  among  the  processors 
will,  therefore,  not  propagate  to  the  memories  and  vice  versa.  If  the  nonredundant  portion  of 
the  voter  represents  a  system  reliability  "bottleneck",  triplicated  voters  can  be  used.  With 
this  scheme  even  the  voter  can  have  either  a  transient  or  a  hard  failure  and  the  cothputer 
will  remain  operational.  Note  that  voting  is  done  in  parallel  on  a  bit  by  bit  basis.  A  computer 
can  have  a  failure  on  a  certain  bit  in  one  bus,  and  proper  operation  will  continue,  provided 
that  the  other  two  buses  have  the  correct  Information  for  that  bit.  There  are  cases, 
therefore,  where  failures  in  all  three  buses  can  occur  simultaneously  and  the  computer  would 
still  be  functioning  correctly. 


% 


7 


users.  All  crsshes  not  traced  to  hardware  failures,  software  bugs,  or  obofstor  mistalUM  wore 
assumed  to  result  from  transient  faults. 


Figure  2-5:  C.vmp  Configuration 


2.2  8EADS:  Date  Collection  on  the  POP-10 


2.2.1  System  Error  Files 

The  core  of  the  PDP-10  error  reporting  system  is  the  online  error  log  file  maintained  by 
the  TOPS- 10  and  TOPS-20  operating  systems.  Entries  are  made  In  tMs  file  for  a  variety  of 
reasons,  most  notably  system  reloads  and  memory  and  I/O  errors  [Dlgl78^  Each  entry 
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contains  the  date  and  time  at  which  it  was  made,  the  processor  serial  number,  and  information 
about  the  type  of  error  or  other  condition  being  reported. 


2.2.2  SEADS  Program 

To  facilitate  statistical  analysis  of  transient  errors  on  PDP-lO’s,  a  program  named  SEADS 
(Statistical  Error  Analysis  Data  Summary)  has  been  written  which  derives  Interarrival  timee 
and  time-of-day  distributions  from  the  system  error  log  files.  The  outputs  generated  include 
the  following: 

-  Lower  bound  estimates  of  system  availabilities,  in  total  and  for  each  file 
processed; 

-  Graphs  of  the  time-of-day  distribution  of  entries,  divided  into  48  half-hour 
segments; 

-  Graphs  of  the  distributions  of  interarrival  times  for  all  entries  in  total,  for  each 
entry  individually,  and  for  arbitrary  sets  of  entries; 

-  Data  files  containing  the  time-of-day  distribution  and  the  lists  of  interarrival 
times  and  error  types. 

Examples  of  the  first  three  types  of  outputs  are  shown  in  Figures  2-6,  2-7,  and  2-8.  Oie 
feature  of  Figure  2-8  should  be  explained.  As  the  data  must  be  scaled  to  fit  on  a  fixed  sice 
histogram,  some  way  of  displaying  data  points  which  are  too  small  to  be  shown  In  scale  must 
be  developed.  The  convention  adopted  was  to  use  two  different  characters:  *X’  for  full  scale 
values,  and  'o*  for  values  too  small  to  otherwise  be  displayed.  Please  note  that  awhile  Ft|ure 
2-6  is  only  a  sample,  the  other  two  figures  represent  the  sum  total  of  all  files  processed  from 
all  four  systems,  comprising  over  8000  hours. 

To  facilitate  data  collection,  SEADS  can  read  up  to  100  separate  system  error  log  files.  It 
can  also  read  the  data  files  it  generates  to  restore  the  internal  program  data  state,  thus 
minimizing  the  computational  requirements  for  analyzing  large  numbers  of  files. 

2.3  AUTODIAGNOSTICS:  Data  Collection  on  Cme 

2.3.1  Motivation  of  AUTODIAGNOSTICS 

Cm*  is  an  experimental  system  in  which  the  hardware  reliability  Is  always  suspect.  Those 
working  on  software  development  would  benefit  from  knowing  the  current  hardware  statue. 
This  is  the  main  purpose  of  the  diagnostic  software  on  Cm*.  The  test  programs  can  also 
detect  transient  errors,  as  was  proposed  by  [Tssa771  and  demonstrated  by  the 


AUTODIAGNOSTICS  system.  Furthermore,  the  results  from  this  study  can  also  help  Cma 
software  researchers  to  judge  the  effectiveness  of  error  recovery  routines  usod  for 
achieving  high  availability. 


2.3.2  Error  Reporting  System 
2.3.2. 1  CMDIAG 

Automatic  diagnostic  software  was  developed  to  exercise  idle  modules  in  Cm*.  One  such 
module  was  used  to  run  the  master  control  program  CMDIAG.  This  module  operated  in  a 
special  mode  which  enabled  it  to  control  other  Cm's  as  though  it  was  a  user  at  a  terminal. 
With  this  ability,  CMDIAG  acquired  control  of  all  unassigned  modules  and  continuously  cycled 
each  of  them  through  a  series  of  four  diagnostic  tests.  The  program  was  able  to  dynamically 
acquire  and  release  modules  in  response  to  the  changing  needs  of  the  other  users. 

The  test  programs  were  designed  to  halt  processor  execution  whenever  an  error  was 
detected.  CMDIAG  detected  this  situation  and  interrogated  the  affected  Cm's  memory  and 
registers  for  pertinent  information,  which  was  then  printed  on  the  Cm*  host  cor>sole  terminal! 

Location  of  error  the  module  that  detected  the  error,  the  test  that  was  running  In  that  Cm, 
and  the  number  of  passes  completed  since  the  test  program  was  loaded 
into  the  module. 

Time  information  the  time  of  day,  the  total  number  of  module-hours  accumulated  up  to  that 
time,  the  module-hours  accumulated  by  that  particular  Cm,  and  tho 
module-hours  accumulated  by  that  test. 

Symptoms  of  the  fault 

error  number,  the  ID  number  of  the  subtest,  and  the  PC  and  SP  values. 


2.3.2.2  Test  Programs 

The  sequence  of  test  programs  consists  of  four  diagnostic  programs.  Each  is  designed  to 
test  a  certain  portion  of  the  Cm*  hardware.  They  are  primarily  Intended  to  detect  hard 
failures  (permanent  or  intermittent  faults),  but  can  be  used  to  detect  transient  faults  as  well. 

Instruction  Tost— DVKAA 

a  processor  exerciser  which  tests  for  correct  operation  of  all  the  basic 
instructions  (excluding  traps)  and  all  the  addressing  modes  [DlfiT!^]. 
Less  than  one-tenth  of  a  second  is  required  for  each  successful  pass 
through  this  test:  therefore  many  hundreds  of  passes  are  run  before 
loading  the  next  test  program. 

Interrupt  and  Trap  Test— DVKAD 

a  processor  exerciser  which  tests  for  correct  operation  of  all  trap  and 
breakpoint  instructions,  and  for  interrupt  and  nonexistent  memory  traps 


as  well  [Dlgi75b].  Each  pass  through  this  test  takes  a  few  seconds  to 
complete. 


Memory  Diagnostic 


Slocal  Test 


-DZKMA 

a  memory  exercisor  for  4K  to  128K  of  dynamic  MOS  RAM  on  LSI-1 1’s 
which  consists  of  a  series  of  13  test  routines  [Digi75c].  Each  pass 
through  this  program  requires  several  minutes,  depending  on  the  size  of 
the  memory  being  tested. 

a  locally  written  exhaustive  test  of  the  Slocal  hardware  which  also  tests 
part  of  the  Kmap  hardware  by  doing  a  small  memory  test  In  "mapped” 
mode.  This  program  completes  a  pass  within  a  few  minutes. 


SEROS  VERSION  3fl(ie0)  ERROR  FILE  ANALYSIS 


COUNT  OF  BAD  TINE  ERRORS:  0 


TOTAL  NUNBER  OF  ENTRIES  FOR  ALL  INPUT  FILES:  164«S 

TINE  SPAN:  1S42  HRS.,  FRON:  17-Feb-79  S:B3:11  TO:  IB-Nav-TB  11:31:59 

RPPROXINRTE  SYSTEH  AVAILABILITY:  0.877 

SYSTEN  #2149  NUNBER  OF  ENTRIES:  344 

TINE  SPAN:  170  HRS.,  FRON:  17-F«b-79  5:03:11  TO:  24-F*b-79  7:39:09 

RPPROXINRTE  SYSTEN  AVAILABILITY:  0.907 

SYSTEN  #2227  NUNBER  OF  ENTRIES:  2045 

TINE  SPAN:  150  HRS.,  FRON:  24-F«b-79  22:22:00  TO:  3-Nar-79  5:99:59 

RPPROXINRTE  SYSTEN  AVAILABILITY:  0.947 

SYSTEN  #2326  NUNBER  OF  ENTRIES:  1149 

TINE  SPAN:  140  HRS. ,  FRON:  3-N«r-79  5:43:04  TO:  9-N«r-79  1:55:27 

RPPROXINRTE  SYSTEN  AVAILABILITY:  0.094 

SYSTEN  #1009  NUNBER  OF  ENTRIES:  12907 

TINE  SPRN:  1001  HRS.,  FRON:  3-Apr-79  19:91:24  TO:  19-Nay-79  11:39:59 

RPPROXINRTE  SYSTEN  AVAILABILITY:  9.047 


Figure  2-6:  Sample  File/Avaiiability  Output  from  SEAOS 


SYSTEN  RELOROED 

<aBl>  ~ 

25B 

NON-RELORO  NON I TOR  ERROR 

<IB2>  ~ 

626 

CPU  NXN  ERROR 

<eB4>  _ 

9766 

DRTR  CHANNEL  ERR 

<eB6>  ~ 

7336 

DISK  UNIT  ERROR 

<B1B>  — 

9B11 

NRGTRPE  STATISTICS 

<»42>  ~ 

946 

KLIB  DRTR  PARITY  INTERRUPT 

<66B>  ~ 

45 

KLie  DRTR  PARITY  TRAP 

<e6i>  ~ 

9 

T0PS28  SYSTEN  RELOADED 

<1B1>  ~ 

539 

TOPS2e  BUCHLT-BUGCHK 

<1B2>  — 

3948 

NRSSBUS  DEVICE  ERROR 

<111>  ~ 

48883 

FRONT  END  DEVICE  REPORT 

<13e>  ~ 
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<131>  ~ 

997 

PROCESSOR  PARITY  TRRP 

<16e>  — 
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PROCESSOR  PARITY  INTERRUPT 

<162>  — 
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NETCON  STARTED 

<2B1>  — 

149 

NETWORK  DOUN-LINE  LORO 

<282>  — 
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NETWORK  UP-LINE  DUNP 

<283>  ~ 
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NETWORK  LINE  STATS 

<23B>  ~ 

858 

DN64  STATISTICS 

<232>  — 
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DISTRIBUTION  BY  TINE  OF  DRV  (ttBB-23t3B) 
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Figure  2-7:  Sample  Time  of  Day  Distribution  Output  from  SEADS 
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DISTRIBUTION  OF  INTERRRRIVRL  TINES 


SNRLLEST  RLLOHED  INTERRRRIVRL  TINE  IS  B.SB  SEC. 

HININUn  VRLUEi  B.BB  SEC.  NRXINUN  VRLUEi  4.BB  DRYS  TINE  INTtRVRLi  3.N  NOWS 

NERN  TINEi  10.99  HOURS  STRNORRO  OEVIRTIONt  15.79  HOURS  HOOG  'BUCKET  f"i  1 
NRXINUN  VRLUEi  103  SCRLE  FRCTORt  3  NUHBCR  OF  ENTRICOl  240 


3.  Data  Analysis  for  Transiant  Errors 


3.1  Analyzing  Interarrival  Timet 

3.1.1  Ditlribulion  of  Inforarrival  Timot 

The  interarrival  data  can  be  plotted  as  a  histogram  (as  in  Figures  2-8  and  3-2)  to  form  an 
approximation  to  the  probability  density  function  of  transient  errors.  This  la  uaeful  In 
deciding  initially  on  which  distributions  to  study.  The  obvious  skews  toward  the  low  end  for 
all  the  data  collected  on  these  systems  indicate  that  the  Welbull  distribution  should  be  used. 
The  exponential  distribution  is  a  special  case  of  the  general  Wbibull  function. 

3.1.2  The  Welbull  Distribution 


A  review  of  the  Weibull  distribution  and  its  related  functions  is  in  order.  The  Wbibull 
distribution  has  two  parameters:  u  (the  shape  parameter)  and  X  (the  scale  parameter). 
(Several  formulations  of  the  Weibull  function  appear  in  the  Hteriture 
[Barl75,  Mali75,  Roma77,  Shoo68,  Thom69].  The  most  common  of  these  are  presented  In 
Appendix  I.)  The  probability  density  function  (pdf),  cumulative  distribution  function  (COr)» 
reliability  function,  and  hazard  (failure  rate)  function  of  the  Welbull  distribution  are  shown  In 


Equations  1  through  4  (a  >  0,  X  >  0): 

pdf  -  f(t)  -  «  X  X  X  (Xt)*-*  X  expHXt)®)  (1) 

CDF  -  F(t)  -  1  -  exp[-(Xt)«]  (2) 

Reliability  -  R(t)  -  exp[-(Xt)"]  (3) 

Hazard  Function  -  z(t)  -  «  x  X  x  (Xt)*"^  (4) 


Note  that  the  values  of  all  these  functions  depends  on  time  only  through  the  product  of  the 
scale  factor  and  time  —  Xt.  Also,  the  hazard  function  z(t)  shows  the  effect  of  the  thipe 
parameter  on  the  failure  rate: 

-  If  a  <  1,  the  failure  rate  is  decreaslt^  with  Xts 

-  if  at  >  1,  the  failure  rate  is  constant  with  time,  resulting  In  an  exponential 
distribution:  and 
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-  if  a  >  1,  the  failure  rate  is  increasing  with  time. 

The  goal  of  this  study  is  to  find  the  value  of  the  shape  parameter  a  for  the  underlying 
distribution  of  transient  errors. 

When  performing  statistical  analysis  of  data,  two  common  measures  are  the  mean  in)  and 
standard  deviation  (<r).  For  the  Weibull  distribution,  these  are  defined  as  follows  in  terms  of 
ft  and  X: 

M  -  X"^  X  r((«+l)/tt)  (6) 

<r  -  X-1  X  [r((«+2)/«)  -  r2((«+l)/«)]M2  (g) 

where  the  gamma  function  r(«)  -  exp(-/B>  dp 

The  influence  of  the  Weibull  parameters  on  the  mean  of  the  distribution  is  iiiustrated  in 
Figure  3-1.  The  maximum  likelihood  estimates  of  the  Weibull  parameters  for  the  recorded 
data  are  indicated  by  diamonds  in  the  graph. 

With  just  these  two  statistics,  the  Weibull  failure  rate  can  be  determined  to  be  decreasing, 
constant,  or  increasing  as  follows: 

-  if  /I  <  O',  the  failure  rate  is  decreasing} 

-  if  p  •  9,  the  failure  rate  is  constant; 

-  if  ^  >  o,  the  failure  rate  is  increasing. 


3.1.3  Estimation  of  the  Weibull  Parameters 


In  the  following  equations,  {Xj,  X2,  «■  X|^}  is  the  set  of  interarrival  times. 

The  Weibull  cumulative  distribution  function  (Equation  2)  can  be  rearranged  into  the 
following: 

Ln  Ln  [1  /  (1  -  F(t))3  -  «xLn(t)  ♦  «xLn(X)  (7) 

Equation  7  is  the  basis  for  using  linear  regression  analysis  to  fit  the  data  to  a  Weibull 
distribution  [Berg74,  Roma77].  If  the  data  is  from  a  Weibull  distribution,  its  plot  should 
approximate  a  straight  line.  The  slope  of  the  straight  line  Is  an  estimate  ot  ft,  and  the 
V-intercept  divided  by  the  slope  is  an  estimate  of  Ln  (X).  The  value  of  the  function  P(t)  la 
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estimated  by 

F(tj)  -  (j  -  0.5)  /  N  (•) 

The  maximum  likelihood  estimators  (MLE)  tt'  and  X'  for  the  Weibull  distribution  satisfy  the 
following  equations  [Thom69]: 

(N  /  «*)  +  Sj!.!  Ln(Xj)  -  N  X  (  Xj«'  x  Ln(Xj)  )  /  (  Sjli  X|«^  )  (9) 

(V)*'  -  N  /  2jll  Xj*'  (10) 

Once  the  value  of  the  shape  parameter  a',  is  known,  Equation  10  can  be  used  to  calcutate  the 
scale  parameter  X'.  Equation  9  can  be  used  to  derive  a  difference  equation  In  the  form 

«'i+l  -  Function  («'j,  Xj^) 

A  quickly  converging  solution  can  be  found  by  using  the  Newton-Raphson  method^  aa  Was 
presented  in  [Thom69].  The  linear  estimate  of  et  found  by  the  linear  regression  analysis 
described  above  is  useful  as  an  initial  value  for  the  iterative  solution  process. 


3.1.4  SEAPLT:  Automated  Weibull  Plot  Analysis 

In  order  to  facilitate  the  analysis  of  large  amounts  of  data,  the  methods  described  above 
were  implemented  in  a  program  named  SEAPLT  (System  Error  Analysis  PLoTter).  Two 
separate  plotting  files  are  output  by  this  program: 

-  a  Weibull  Plot  showing  the  linearized  fit  of  the  data  to  the  distribution!  and 

>  an  adjusted  histogram  of  the  interarrival  time  distribution,  with  on  approximate 
Weibull  probability  density  function  superimposed. 

In  addition  to  the  plots,  both  the  linear  and  MLE  estimates  of  «  and  X  are  calculated,  along 
with  the  mean  and  standard  deviation  of  the  data.  All  the  graphs  used  In  this  paper  and  all 
the  values  given  in  the  tables  were  generated  by  this  program.  The  procedures  for  the 
linear  regression  analysis  and  MLE  calculations  are  presented  in  Appendix  IL 


3.2  Graphical  Data  Analysis 

System  reloads  were  chosen  as  likely  to  reflect  transient  errors,  as  they  are  commonly 
caused  by  the  ubiquitous  "crash".  In  systems  with  stable  hardware  and  matured  software, 
the  most  frequent  cause  of  crashes  appears  to  be  transient  errors. 

In  scanning  the  data  generated  by  SEAOS,  it  became  clear  that  the  PDP-10  systema 
frequently  recorded  several  errors  for  one  fault.  To  mask  out  the  effects  of  this,  error 
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entries  within  five  minutes  of  a  previous  entry  were  counted  as  a  part  of  the  previous  fault. 
It  was  felt  that  five  minutes  was  a  reasonable  threshold  for  the  study.  The  software  allowed 
any  choice  for  the  threshold,  facilitating  examination  of  the  the  sensitivity  of  the  data  to 
threshold  values^. 

Two  groups  of  data  are  presented  for  system  reloads.  The  first  is  taken  from  the 
individual  system  (TOPSC)  which  had  the  most  complete  data.  The  second  Is  the  collected 
data  from  all  four  systems. 

Figures  3-2  and  3-4  show  histograms  of  the  distributions  of  the  interarrival  times  for 
system  reloads  on  TOPSC  and  for  all  four  systems,  overlaid  with  the  MLE  Weibuil  probability 
density  functions.  Figures  3-3  and  3-5  show  the  plots  of  the  TOPSC  and  overall  PDP-10 
reload  data  on  Weibull  paper.  (The  straight  lines  drawn  on  all  the  Weibull  plots  are  least 
squared  error  fits  to  the  data.)  Note  that  most  of  the  visual  deviation  is  due  to  relatively  few 
points  at  the  lower  end. 

The  second  class  of  events  likely  to  reflect  transient  errors  in  the  PDP-10  data  was  tho 
memory  parity  error  interrupt.  Except  in  the  case  of  failing  devices  which  cause  intermittent, 
and  finally  permanent,  faults,  these  are  always  the  result  of  transient  faults  in  the  memory 
system.  Figures  3-6  and  3-7  show  the  interarrival  distribution  and  the  Weibull  plot  of  the 
data.  In  this  case,  only  the  total  data  for  all  four  systems  is  shown,  as  too  few  data  points 
were  collected  from  any  one  of  the  four  systems  to  be  statistically  significant. 

Figures  3-8  and  3-10  shows  the  adjusted  histograms  of  the  interarrivals  for  Cms  and 
C.vmp  respectively.  Figures  3-9  and  3-11  are  plots  of  the  interarrival  data  for  each  system's 
transient  errors  on  Weibull  paper.  The  straightness  of  the  data  shows  that  the  sampiea 
follow  a  Weibull  distribution. 

3.3  Genwral  Statistics  and  Confidence  Intervals 

Table  3-1  lists  some  general  statistics  about  the  interarrival  times  for  the  five  sets  of  data: 
TOPSC  reloads,  POP-10  reloads,  POP-10  parity  errors,  C.vmp  crashes,  and  Cm*  transient 
errors.  In  all  cases,  the  mean  is  less  than  the  standard  deviation,  indicating  a  decreasing 
failure  rate  (ft  <  1). 

A  90^  confidence  interval  for  a  and  X  was  generated  for  the  last  three  sets  using  methods 
developed  in  Thoman  et  al.  [Thom69]  (The  90^  confidence  interval  is  the  range  within  which 
the  actual  value  of  the  estimated  parameter  would  fall  ninety  percent  of  the  time  If  the 

^ThrcahoM  valuaa  of  ono  minuto  ond  (on  minu(oo  wofo  oloo  (rM  without  chonfint  tht  rtotdtt  artsawtsS  hare. 


♦T-.T* 


i 


POP-10 


TOf_ 
Reload 


Reload  Parity 


Cm* 


Time  (hrs) 

2646 

8576 

8596 

4222 

4921 

Errors 

195 

636 

74 

103 

50 

Interarrivals 

196 

640 

78 

104 

51 

M 

13.5 

13.4 

110.2 

40.6 

963  (328) 

9 

16.5 

24.6 

244.9 

59.8 

1673  (471) 

ae  (Linear) 

0.864 

0.684 

0.500 

0.834 

0.711 

a'  (MLE) 

0.826 

0.639 

0.481 

0.779 

0.654 

X  (Linear) 

0.0843 

0.109 

0.0206 

0.0294 

0.0149 

X'  (MLE) 

0.0826 

0.106 

0.0203 

0.0288 

0.0146 

Table  3-1:  Statistics  for  Transient  Errors 


experiment  was  repeated  many  times.)  These  values  are  listed  in  Table  3-2.  Note  that  the 
range  of  values  for  «  does  not  include  1.0  (i.e.,  the  exponential  distribution)  for  any  of  the 
three  sets  of  data. 


t*!ow**high^ 

[^low’^high^ 


PDP-10  Parity 

[0.421,0.566] 

[0.0134,0.0307] 


Cm* 

[0.693,0.893] 

[0.0231,0.0359] 


Table  3-2:  90Z  Confictonce  Intervals  for  at  and  X 


C.vmD 

[0.538,0.806] 

[0.0099,0.0214] 


3.4  GoodfMtt  of  Fit  Teat 

To  confirm  the  impression  given  by  the  Weibull  plots  that  the  data  collectad  on  transient 
errors  for  the  various  systems  does  in  fact  fit  a  Weibull  distribution,  a  Chi-square  goodness 
of  fit  test  was  made  for  each  of  the  five  sets  of  data.  Basically,  such  a  teat  divides  the  data 
Into  several  equiprobable  regions  and  measures  the  deviation  from  the  expected  numbof  of 
samples  in  each  of  these  regions: 

Oj  -  observed  frequency  in  the  i’th  region 


^NeU  that  tha  paatimtattc  valua  diacutaad  in  [Siaw7S]  ia  uaad  throuthaut  for  C.vais  bacaua*  thara  wars  las  few 
Intararrivala  In  tha  optimiatic  voKm  (ahown  in  paronthaaao  for  tha  moan  and  atandard  davlitlon)  to  b*  stslHitlaaay 
alt  ntf  leant. 
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E|  -  expected  frequency  in  the  i'th  region 

\^en  the  number  of  samples  is  laree  eno^h  fas  is  the  case  for  all  sets  of  data  under 
consideration),  then  the  following  statistic  follows  a  distribution  with  K-3  degrees  of 
freedom: 

Q  -  ZHi  (Oj  -  Ei)2/Ei  (11) 

The  resuits  of  the  tests  performed  are  given  in  Table  3-3. 


Degrees  of  Freedom  d 
Level  of  Significance  p 


TOPSC 

Reload 

PDP-10 

Reload 

POP-10 

Parity 

Cm* 

CLviwd 

23.36 

6.40 

6.72 

9.46 

3.71 

34 

5 

11 

17 

7 

0.90 

0.25 

0.80 

0.90 

0.80 

23.95 

6.63 

6.99 

10.08 

3.82 

Table  3-3:  X^  Gk>odness  of  Fit  Test  Statistics 


The  level  of  significance  of  the  Chi-square  test  is  the  probability  of  erroneously  rejecting 
the  hypothesis  that  the  data  is  from  the  given  distribution.  (Statisticians  call  this  a  Type  1” 
error  [DeGr75,  Roma77].)  Large  values  are  desirable  for  this  figure  if  they  can  be  achieved, 
as  the  likelihood  of  wrongly  accepting  a  false  hypothesis  (Type  II"  error)  is  inversely  related 
to  the  probability  of  Type  I  errors  for  fixed  sample  sizes.  Typical  values  for  the  level  of 
significance  in  statistical  tests  range  from  0.10  to  0.01.  All  of  the  results  shown  In  Table  3-3 
are  much  higher  than  this,  showing  very  good  fits  to  the  Weibull  distribution. 

To  complete  the  testing  procedure,  a  Chi-square  test  was  done  for  each  of  the  five  sets  of 
data  assuming  an  exponential  distribution.  The  comparison  of  these  results  are  shown  In 
Table  3-4.  Although  the  exponential  hypothesis  fits  the  data  fairly  well  In  a  couple  of  cases, 
in  every  case  the  Weibull  fit  is  much  better. 


Degrees  of  Freedom  d 
Level  of  Significance  p 


Table  3-4:  X^  Test  of  Exponential  Distribution 


TOPSC 

Reload 

PDP-10 

Reload 

PDP-10 

Parity 

Cma 

C.ytnD 

30.61 

252.55 

79.95 

15.14 

18.35 

30 
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7 
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Figure  3'1:  Means  of  Weibull  Distributions 
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Figure  3-6:  Distribution  of  PDP-10  Parity  Interrupts 
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4.  Conclusions 

The  crash  history  of  C.vmp  from  August  1977  to  May  1978  (4900  hours)  shows  that  tho 
interarrival  times  of  transient  errors  follow  a  Weibull  distribution  with  a  shape  parameter  lees 
than  1,  which  indicates  a  decreasing  failure  rate.  Similarly,  analysis  of  transient  error  data 
collected  on  Cm*  from  September  1977  to  August  1978  (4200  hours)  and  data  collected  from 
POP- 10  systems  between  August  1978  and  March  1979  (8600  hours)  seem  to  support  the 
same  conclusion.  These  systems  range  in  size  from  an  NMOS  microprocessor  with  12K  words 
of  memory  to  ECL  mainframes  with  one  megaword  of  memory,  and  range  in  redundarwy  from 
nonredundant  to  parity  to  triplication.  This  wide  range  application  Of  tho  decreasinf  falture 
rate  Weibull  distribution  means  that  much  rethinking  will  have  to  be  done  On  transient 
modelling  and  statistical  analysis  of  transient  data. 
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I.  Forms  of  tho  Woibull  Distribution 

Five  main  formulations  of  the  Weibuil  distribution  appear  in  the  literature 
[Barl75,  Mali75,  Roma77,  Shoo68,  Thom69].  The  probability  density  functions  tor  each  Of 
these  is  presented  below.  Note  that  the  first  one  presented  Is  the  same  as  that  given  in 
Section  3. 

f(t}a,X)-a  X  X*  X  X  exp[-<Xt)“] 
f(t{a^)-«  X  ^  X  t«-l  X  exp[-«xt«] 
f(t;  «,b)  -  «  X  b"*  X  X  exp[-{t/b)“] 

f(t;  -  («//3)  X  X  exp[-(t*)//3] 
f(t;m,K)-K  x  t*"  x  expHKAm+lWxp+l] 

Following  is  a  table  which  gives  the  equivalent  parameter  values  for  the  different  forms  of 
the  Weibuil  distribution.  The  first  column  gives  the  values  of  the  shape  parameter  In  terms  of 
the  value  of  a  used  in  Section  3.  The  second  column  gives  the  values  of  the  scale 
parameters  in  terms  of  X  from  the  same  formulation.  The  third  column  gives  the  value  of  ft  In 
terms  of  the  different  shape  parameters  and  the  fourth  column  gives  the  value  of  X  In  terms 
of  the  different  scale  parameters. 


Shape 

Scale 

ft 

f(tj  «,X) 

ft 

X 

ft  X 

f(t}  9lfi) 

a 

X« 

ft 

f(tj  «,b) 

« 

x-1 

ft  b“^ 

f<tj 

ft 

x-« 

ft 

f(t;  m,K) 

tt-1 

«xX« 

m+l  (K/(m+l))^/<»"^i> 

Table  4-1:  Relationships  Between  Weibuil  Parameters 
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II.  Computational  Methods 


II.  1  Linear  Regression  Analysis 

As  explained  earlier,  linear  regression  analysis  is  used  to  fit  a  straight  line  to  the  data  in 
the  various  Weibull  plots.  Following  is  the  code  performing  this  analysis. 


C 

C  VARIABLE  DECLARATIONS 
C 

REAL  DELTA  ! 

DIMENSION  OELTA(IOOO) 
INTEGER  KSIZE 
DOUBLE  PRECISION  SIZE 
DOUBLE  PRECISION  X 
DIMENSION  X(IOOO) 
DOUBLE  PRECISION  SUMX 
DOUBLE  PRECISION  SUMX2 
DOUBLE  PRECISION  Y 
DIMENSION  Y(IOOO) 
DOUBLE  PRECISION  SUMY 
DOUBLE  PRECISION  SUMXY 
DOUBLE  PRECISION  A 
DOUBLE  PRECISION  B 
REAL  ALPH.LAMBO 


DOUBLE  PRECISION  EPSILN 
DOUBLE  PRECISION  SUMA 
DOUBLE  PRECISION  SUMB 
DOUBLE  PRECISION  SUMC 
DOUBLE  PRECISION  DTEMP 
DOUBLE  PRECISION  ALPHA 
DOUBLE  PRECISION  LAMBDA 


ARRAY  OF  INTERARRIVAL  TINES 

COUNT  OF  INTERARRIVAL  TINES 
FLOATING  VALUE  OF  KSIZE 
Ln  OF  INTERARRtVAL  TINES 

SUM  OF  X(I) 

SUM  OF  X(I)A*2 

TRANSFORMED  UEIBULL  CDF  VALUES 

SUM  OF  Y(I) 

SUM  OF  X(I)  *  Y(I) 

SLOPE  OF  HEIBULL  LINE  (Y«AN^S) 
INTERCEPT  OF  HEIBULL  LINE 
LINEAR  ESTIMATES 


ALLOUED  ERROR  IN  C0NVER6INS 
SUM  OF  DELTA(I)atAALPHA 
SUM  OF  DELTA(I)AeALPHAAX(I> 

SUM  OF  DELTA(I)AiAALPHA*X<I)^ 
UORKING  (TEMPORARY)  VARIABLE 
MLE  UEIBULL  ALPHA 
MLE  UEIBULL  LAMBDA 


a  a  f 


C  SORT  THE  INPUT  DATA 
C 

1600  CALL  SORT(DELTA,KSIZE) 

C 

SIZE  =  DBLE(  FLOAT(  KSIZE  )  ) 

SUHX  =  O.ODO 
SUNX2  =  O.ODO 
SUMY  =  O.ODO 
SUHXY  O.ODO 
C 

DO  2400  I -1, KSIZE 
C 

C  PLOT  Ln  OF  DATA  ON  X-AXIS 

C 

X(I)  -  DLOG(DBLE(DELTA(I))> 

C 

C  PLOT  Ln  Ln  <1  /  (1  -  F(t»  )  ON  Y-AXIS 

C  F(t)  IS  ESTIMATED  BY 

C  F(t[I])  >  (I  -  0.5)  /  NUMBER  OF  FAILURES 

C 

Yd)  =  1.000  /  (1.000  -  ((DBLE(FLOAT(I))-0.500)/8I2E)) 
Yd)  =  DLOG(  DLOG<  YCI)  )  ) 

C 

C  TRY  LINEAR  APPROXIMATION  USING  LEAST  SQUARES  RtT 

C  Y  =  A>»X  +  B 

C 

SUMX  s  SUMX  *  Xd) 

SUMX2  -  SUMX2  +  (  Xd)  2  ) 

SUMY  s  SUMY  +  Yd) 

SUMXY  =  SUMXY  +  <  Xd)  *  Yd)  ) 

C 

2400  CONTI NUE 
C 

A  =  ((SI2E’fcSUMXY)-(SUHX*SUMY))  /  <(SIZE*SUNX2)-(SUMXAAS» 
B  =  (SUMY/SIZE)  -  (A  *  (SUMX/SIZE)  ) 

ALPH  =  SNGLCA)  I  A  «  ALPHA 

LAMB  «  SNGL(  DEXP(B/A)  )  {  B  *  ALPHA  A  Ln(LANBt)A) 


II.2  MLE  Parameter  Calculation 

The  following  code  segment  calculates  the  MLE  values  for  the  Weibull  Rerametert.  Thtt 
follows  the  previous  example  in  the  same  program. 
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C 

C  CALCULATE  THE  MAXIMUM  LIKELIHOOD  ESTIMATES  OF  ALPHA  «  LAMDA 
C 

C  A  NEUTON-RAPHSON  ITERATION  METHOD  IS  USED  TO  CALCULATE 

C  ALPHA.  (SEE  Thonant  et  al.y  "Inferences  on  the 

C  Parameters  of  the  Uelbull  Distribution",  TECHNOMETKl Clf 

C  August  1969,  page  458.) 

C 

C  THE  FORMULA  FOR  LAMBDA  COMES  FROM  PAGE  445  OF  THE  SAHE 

C  PAPER. 

C 

EPSILN  «  1.00-6 
ALPHA  =  A 

2700  SUMA  s  0.000 
SUMB  =  0.000 
SUMC  =  0.000 
DO  2800  I=1.KSI2E 

DTEMP  -  DEXP(X<I)  *  ALPHA) 

SUMA  =  SUMA  *  DTEMP 

SUMB  =  SUMB  *  (DTEMP  *  X(I)) 

SUMC  s  SUMC  *  (DTEMP  *  (X(I)e*2)) 

2800  CONTINUE 

DTEMP  a  ALPHA 
ALPHA  a  DTEMP  + 

2  (((1/DTEMP)+(SUMX/SIZE)-(SUMB/SUMA))  / 

3  (  (  1/DTEMP»a2) +(  ((SUMA*tSUMC)  -SUMBeA2)/BUMAeAf )  )  ) 
IF  ((DABS(ALPHA-OTEMP)/DTEMP)  .GT,  EPSILN)  2700,2800 

2900  SUMA  a  0.000 

DO  3000  I  a  I^KSIZE 
DTEMP  a  DEXP(X(I)  *  ALPHA) 

SUMA  a  SUMA  *  DTEMP 
3000  CONTINUE 

LAMBDA  a  DEXP(  0L0G(SIZE/SUMA)  /  ALPHA  ) 
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